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Abstract 

When triangulating a belief network we aim to obtain a junction tree of minimumstate space. According to 
[8], searching for the optimal triangulation can be cast as a search over all the permutations of the network’s 
variables. Our approach is to embed the discrete set of permutations in a convex continuous domain D. 
By suitably extending the cost function over D and solving the continous nonlinear optimization task we 
hope to obtain a good triangulation with respect to the aformentioned cost. In this paper we introduce 
an upper bound to the total junction tree weight as the cost function. The appropriatedness of this choice 
is discussed and explored by simulations. Then we present two ways of embedding the new objective 
function into continuous domains and show that they perform well compared to the best known heuristic. 
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1 Introduction. What is triangulation ? 

Belief networks are graphical representations of proba¬ 
bility distributions over a set of variables. For an intro¬ 
ductory, yet rigorous treatment of graphical probability 
models, the reader is refered to [5]. In what follows it 
will be always assumed that the variables take values in 
a finite set and that they correspond to the vertices of 
a graph. The graph’s arcs represent the dependencies 
among variables. There are two kinds of representations 
that have gained wide use: one is the directed acyclic 
graph model, also called a Bayes net , which represents 
the joint distribution as a product of the probabilities 
of each vertex conditioned on the values of its parents; 
the other is the undirected graph model, also called a 
Markov field, where the joint distribution is factorized 
over the cliques 1 of an undirected graph. This factor¬ 
ization is called a junction tree and optimizing it is the 
subject of the present paper. The power of both mod¬ 
els lies in their ability to display and exploit existent 
marginal and conditional independencies among subsets 
of variables. Emphasizing independencies is useful from 
both a qualitative point of view (it reveals something 
about the domain under study) and a quantitative one 
(it makes computations tractable). The two models dif¬ 
fer in the kinds of independencies they are able to repre¬ 
sent and often times in their naturalness in particular 
tasks. Directed graphs are more convenient for learning 
a model from data; on the other hand, the clique struc¬ 
ture of undirected graphs organizes the information in 
a way that makes it immediately available to inference 
algorithms. Therefore it is a standard procedure to con¬ 
struct the model of a domain as a Bayes net and then to 
convert it to a Markov Held for the purpose of querying 
it. 

This process is known as decomposition and it con¬ 
sists of the following stages: first, the directed graph is 
transformed into an undirected graph by an operation 
called moralization. Second, the moralized graph is tri¬ 
angulated. A graph is called triangulated if any cycle 
of length > 3 has a chord (i.e. an edge connecting two 
nonconsecutive vertices). If a graph is not triangulated 
it is always possible to add new edges so that the result¬ 
ing graph is triangulated. We shall call this procedure 
triangulation and the added edges the fill-in. In the final 
stage, the junction tree [7] is constructed from the ma¬ 
ximal cliques of the triangulated graph. We define the 
state space of a clique to be the cartesian product of the 
state spaces of the variables associated to the vertices in 
the clique and we call weight of the clique the size of this 
state space. The weight of the junction tree is the sum 
of the weights of its component cliques. All further ex¬ 
act inference in the net takes place in the junction tree 
representation. The number of computations required 
by an inference operation is proportional to the weight 
of the tree. 

For each graph there are several and usually a large 
number of possible triangulations, with widely varying 
state space sizes. Moreover, triangulation is the only 

1 A clique is a fully connected set of vertices and a maximal 
clique is a clique that is not contained in any other clique. 


stage where the cost of inference can be influenced. It is 
therefore critical that the triangulation procedure pro¬ 
duces a graph that is optimal or at least “good” in this 
respect. 

Unfortunately, this is a hard problem. No optimal tri¬ 
angulation algorithm is known to date. However, a non- 
optimal triangulation is readily obtained; a simple algo¬ 
rithm is Rose’s elimination procedure [8] which chooses 
a node v of the graph, connects all its neighbors to form 
a clique, then eliminates v and the edges incident to it 
and proceeds recursively. The resulting filled-in graph is 
triangulated. 

It can be proven that the optimal triangulation can 
always be obtained by applying Rose’s elimination pro¬ 
cedure with an appropriate ordering of the nodes. It 
follows then that searching for an optimal triangulation 
can be cast as a search in the space of all node permu¬ 
tations. The idea of the present work is the following: 
embed the discrete search space of permutations of n 
objects (where n is the number of vertices) into a suit¬ 
ably chosen continuous space. Then extend the cost to 
a smooth function over the continuous domain and thus 
transform the discrete optimization problem into a con¬ 
tinuous nonlinear optimization task. This allows one to 
take advantage of the thesaurus of optimization methods 
that exist for continuous cost functions. 

This idea is developed in section 3. Section 2 intro¬ 
duces the cost function that we used, which is an upper 
bound on the junction tree weight that is easier to com¬ 
pute over our domain. The same section also discusses 
the relationship to other objective functions for triangu¬ 
lation. Section 4 evaluates both the cost function and 
our methods by simulations. Section 5 contains final re¬ 
marks. 

2 The objective 

In this section we introduce the objective function that 
we used and we discuss its relationship to the junction 
tree weight. We also review other possible choices of cost 
functions and the previous work that is based on them. 

First we introduce some notation. Let G = (D, E ) 
be a graph, its vertex set and its edge set respectively. 
Denote by n the cardinality of the vertex set V, by r v 
the number of values of the (discrete) variable associated 
to vertex v £ V, by ff the elimination ordering of the 
nodes, such that ffv = i means that node v is the i-th 
node to be eliminated according to ordering ff, by n(i>) 
the set of neighbors of v £ V in the triangulated graph 
and by C v = (A} U {u £ n(v) | > #v} 2 , (e.g. 

the clique that formes by the elimination of v). Then, a 
result in [4] allows us to express the total weight of the 
junction tree obtained with elimination ordering ff as 

J t#) = yi ismax (^) n r u (1) 

v£V u g c v 

where ismax(C„) is a variable which is 1 when C v is a 
maximal clique and 0 otherwise. As stated, this is the 
objective of interest for belief net triangulation. Any 

2 Both n(u) and C v depend on # but we chose not to 
emphasize this in the notation for the sake of readability. 



reference to optimality henceforth will be made with re¬ 
spect to J*. 

This result implies that there are no more than n ma¬ 
ximal cliques in a junction tree and provides a method to 
enumerate them. This suggests defining a cost function 
that we call the raw weight J as the sum over all the 
cliques C v (thus possibly including some non-maximal 
cliques): 

j (#) = z) n ( 2 ) 

v£Vu£C v 

J is the cost function that will be used throughout this 
paper. 

Another objective function, used (more or less explic¬ 
itly) by [9] is the size J F of the fill-in: 

J[#) = \ F #\ ( 3 ) 

where F # is the set of edges added by the elimination al¬ 
gorithm. There exists a method, the lexicographic search 
[9], that finds minimal triangulations with respect to J F , 
but finding the minimum one is NP-hard [10]. It can be 
proven [8] that for a constant number of values r v per 
node, the minimal triangulations with respect to J F are 
also local minima for J* and J. Even if most of the local 
minima found by lexicographic search were good enough 
(something that is not supported by practical experience 
[6]), the problem with this algorithm is that it takes into 
account only topological information, ignoring the values 
of r v . As our simulation will show, this is an important 
drawback. 

Kjaerulff introduced the minimum weight (MW) 
heuristic [6], a greedy minimization method for J* (thus 
taking r v into account) and later a simulated annealing 
approach [7] that explicitly optimizes J*. 

Becker [3] introduced recently a triangulation algo¬ 
rithm which is not based on node elimination. The algo¬ 
rithm minimizes the cliguewidth J c , which is the largest 
clique-weight in the junction tree. 

J c = max c r u . (4) 

u£C 

J c is coarser than J* in the sense that triangula¬ 
tions with different values of J* can have the same 
cliqueweight. But we expect that with the increase of r v 
and of the graph density the cost of the largest clique will 
tend to dominate J* improving the agreement between 
the two criteria. Optimizing J c is provably NP-hard [1]. 

Now back to J. A reason to use it instead of J* in our 
algorithm is that the former is easier to compute and to 
approximate. But it is natural to ask how well do the 
two agree? 

Obviously, J is an upper bound for J*. Moreover, it 
can be proved that if r = min r v the following inequali¬ 
ties hold: 

< j ;„ is) 

•'<#> S hr, < (O 

J c is always lower bounding J* , with equality in the 
case of a fully connected graph. As for J, by (6) it is 


less than a fraction l/(r — 1) away from J*. The upper 
bound is attained when the triangulated graph is fully 
connected and all r v are equal. 

In other words, the differece between J and J* is 
largest for the highest cost triangulation. We also expect 
this difference to be low for the low cost triangulations, 
where our search will be concentrated. An intuitive ar¬ 
gument for this is that good triangulations are associated 
with a large number of smaller cliques rather than with 
a few large ones. Think for example of the tree rep¬ 
resented in figure 1. A tree is an already triangulated 
graph, so its best triangulation contains no fill-in edges 
and comprises n — 1 maximal cliques of size 2. However, 
it’s worst triangulation, the fully connected graph, will 
have only 1 maximal clique of size n. But the former sit¬ 
uation - many small maximal cliques - means that there 
will be only a few non-maximal cliques of small size to 
contribute to the difference J — J* , and therefore that the 
agreement with J* is usually closer than (6) implies. The 
simulations in section 4 will further support our choice. 

Note also that the reverse argument holds for J c : the 
maximum clique size is an inaccurate approximation of 
J(*#) for low cost triangulations but it is close to the true 
cost function for the worse ones, that produce a small 
number of large cliques. 

3 The continuous optimization problem 

This section shows two ways of defining J over contin¬ 
uous domains. Both rely on a formulation of J that 
eliminates explicit reference to the cliques C v and that 
will be deduced now. 

Let us first define new variables /a uv and e uv , u,v = 
1, .., n. For any permutation # 

1 if #u < #v 
0 otherwise 

_ f 1 if the edge (u,v) £ E LI F # 

Cuv ~ (0 otherwise 

In other words, fi represent precedence relationships 
and e represent the edges between the n vertices after 
the elimination. Therefore, they will be called prece¬ 
dence variables and edge variables respectively. With 
these variables, J can be expressed as 

•'<#) = in c— (?) 

v£V u£V 

The product fi vu e vu acts as an indicator variable being 1 
iff “u £ CV’ is true. For any given permutation, finding 
the fi variables is straightforward. It remains to show 
how to compute the e variables, or, in other words, how 
to find in advance, based on the precedence variables 
only, if a certain edge not in E will appear after elimina¬ 
ting the vertices in a given order. This is possible, thanks 
to a result in [9]. It states that an edge ( u , i>) is contained 
in F 1 # iff there is a path in G between u and v containing 
only nodes w for which #w < minFormally, 
e uv = e vu = 1 iff there exists a path P = ( u , uq, w 2 , . . .v) 
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Figure 1: Example illustrating inequalities (5) and (6). (b) and (c) are two triangulations of the graph represented 
in (a). For the triangulation in (b), J* is at its optimum, J is only slightly larger and J c is much smaller (by a 
factor of 1 /(n — 1)) than J*\ for the fully connected triangulation represented by (c), J* is at its maximum and J c , 
also at its maximum, is equal to it, whereas J is at the maximum possible distance (but less than a factor of 2 away) 
from J*. 


such that 

| | /* 2 , n jCi . / — 1 

Wi£P 

So far, we have succeeded to define the cost. J asso¬ 
ciated with any permutation in terms of the variables p. 
and e. In the following, the set of permutations will be 
embedded in a continuous domain. As a consequence, p. 
and e will take values in the interval [0, 1] but the form 
of J in (7) will stay the same. 

The first, method, called p.-cont.inuous embedding (p- 
CE) assumes that, the variables p. uv £ [0,1] represent, 
independent, probabilities that. #u < #v. To make an 
assignment, of the p. variables represent, a. permutation, 
we need to ensure that it. is antisymmetric and transitive. 
Antisymmetry means that #v < #u> and #u> < #v 
cannot, be true in the same time and it. is guaranteed 
by definition. Transitivity means that if < #v and 
#v < #u>, then < #u>, or, that for any triple 
(/j, uv , Hvw, H'wu) the assignments (0, 0, 0) and (1, 1, 1) are 
forbidden. We will penalize the possible violations of 
transitivity using the probabilistic interpretation of p.. 
According to it, we introduce a. penalty term that upper 
bounds the probability of a. nontransitive triplet.: 


R(v) = J^P[(w, v, w) nontransitive] (8) 

If <U1 

= y UV l^vw l^wu + (1 f^uv )(1 )(1 A*inu)] 


If <UA 

> P [assignment, non transitive] 


Recovering the permutation from a. p assignment, is easily 
done noting that 


0 < #v - 1 = y/q„, < n-1 (9) 

llfa) 


The formula, can be used for non-integer p values as well. 

In the second approach, called ^-continuous embed¬ 
ding (0-CE), the permutations are directly embedded 


into the set. of doubly stochastic matrices. A doubly 
stochastic matrix 6 is a. matrix for which the elements 
in a. row or column sum to one. 

y Ojj = y Ojj = 1 Oij > 0 for i, j = 1, ..n. (10) 

i j 

When 6ij are either 0 or 1, implying that there is exactly 
one nonzero element, in each row or column, the matrix 
is called a. permutation matrix. Oij = 1 and = j 
both mean that the position of object, i is j in the given 
permutation. The set. of doubly stochastic matrices 0 is 
a. convex polyt.ope of dimension (n — l) 2 whose extreme 
points are the permutation matrices [2]. Thus, every 
doubly stochastic matrix can be represented as a. convex 
combination of permutation matrices. To constrain the 
optimum to be a. an extreme point., we add the penalty 
term _ 

/•V'! y".ll "y! (11) 

ij 

The precedence variables are defined over 0 as 

P'VV - 1 

7*1111 = 1 — 7* fti = 1 _\ > ^ g y] j < i i i 

u, v, i, j = 1, ..n and v ^ u 

In both embeddings, the edge variables are computed 
from p. as follows 

{ 1, for (u, v) £ E or u = v 

max Yl weP Pwu^win otherwise 

P£{paths u— 

(! 2 ) 

The above assignments give the correct, values for p. 
and e for any set. of 0 values representing a permuta¬ 
tion. Over the interior of the domain, e is a. continuous, 
piecewise differentiable function. Each e uv , (u,v) fP 
can be computed by a. shortest, path algorithm between 
u and v, with the length of (uq/u'o) £ E defined as 
( 7*ui i n 7*in 2 ii) • 
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| density j 

n 

.05 

.1 

.2 

10 

20 

30 

40 

1.033 < 1.113 < 1.212 
1.037 < 1.188 < 1.447 
1.023 < 1.234 <1.662 
1.020 < 1.244 <1.825 

1.011 < 1.180 < 1.621 
1.018 < 1.221< 1.695 
1.018 < 1.236 < 1.838 
1.017 < 1.249< 1.869 

1.016 < 1.184 < 1.502 
1.012 < 1.228 < 1.829 
1.014 < 1.243 < 1.874 
1.014 < 1.244 < 1.861 


Table 1: Median values of the minimum, median and maximum values of J/J* obtained for 50,000 triangulations 
x 100 random DAGs. r v was random in the range 2-6, giving an upper bound of 2. The mean values for each graph 
were very close to the median values. 


r u . log J F log J c logj J/J* 
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Figure 2: Size of the fill-in J F (a), maximum clique J c (b) and raw weight J (c) versus J*; histogram of J/J* (d) for 
10,000 triangulations of a 30 node, 87 edges DAG. The r v values are uniform in the ranges 2-10 (I), 12-20 (II), 32-40 
(III). For thfe-.histograms in (d) the right limit of the plot is placed at the theoretical upper bound ?’ m in/(t’min — 1). 
For (a), (b), (c), “line” thinner means better agreement with J*. 


0-CE is an interior point method whereas in p.-CE the 
current point, although inside [0, l]"!" -1 >/ 2 isn’t, neces¬ 
sarily in the convex hull of the hypercube’s corners that 
represent permutations. The number of operation re¬ 
quired for one evaluation of J and its gradient is as fol¬ 
lows: 0(n 4 ) operations to compute p. from 0 , 0(n 3 logo) 
to compute e, 0(n 3 ) for and 0(n 2 ) for -|^ and -|^- 
a.fterwards. Since computing p. is the most computa¬ 
tionally intensive step, p.-CE is a clear win in terms of 
computation cost.. In addition, by operating directly in 
the p. domain, one level of approximation is eliminated, 
which makes us expect, it. to perform better than 0-CE. 
This expectation will be confirmed by the results in the 
next, faction. 

4 Experimental results 

Simulations were performed to explore the usefulness of 
J as a. cost, function and to assess the performance of our 
algorithms. 

For the first, goal, we generated random directed 


acyclic graphs (DAGs) of different, sizes and densities 3 
on which we computed the values of J* and J for 50,000 
random triangulations. For each of them, table 1 synt.e- 
sizes the results in terms of J/ J*. It. can be seen that, the 
typical values are much lower than the theoretical bound 
1 + 1 /(— 1). The large values tend t.o spread towards 
the upper bound with the increase of the number of ver¬ 
tices n, but. the median and lowest, values increase much 
slower if at. all, suggesting that, the agreement, between J 
and J* is better than theoretically predicted over a. wide 
range of graph sizes, densities and structures. 

In figure 2 we present, the relationship between 
J F , J c , J and J* for a. 30 node graph and various ranges 
for r v . The plots confirm that. J F is a. poor substitute 
for J*. It. can also be seen that. J has the best, agreement, 
with J* in all cases with J c as a. second. As predicted 
by (6) the agreement, improves when r v becomes larger. 
Regarding J c , its increase in “coarseness” with increas¬ 
ing r v is reflected by the “step-like” appearance aspect. 

3 We defined the density to be the ratio between \E\ and 
the maximum possible number of edges n(n — l)/2. 
















graph 

h9 

hl2 

dlO 

m20 

a20 

d20 

n = V r 

9 

12 

10 

20 

20 

20 

density 

.33 

.25 

.6 

.25 

.45 

.6 

^ min / T max / ^ avg 

2/2/2/ 

3/3/3 

6/15/10 

2/8/5 

6/15/10 

6/15/10 

l°gio Jmw 

2.43 

2.71 

7.44 

5.47 

12.75 

13.94 


Table 2: Characteristics of the graphs used in the experiments. 



h9 hi 2 dio m20 a20 d20 
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Figure 3: Minimum, maximum (solid line) and median (dashed line) values of -ji — obtained by 0-CE (a) and //-CE 
(b). 


of J c , whereas the expected improvement in the agree¬ 
ment with J* for large r v is not evident in the present 
simulations. 

For the second goal we compared the results of our al¬ 
gorithms with the results of the minimum weight heuris¬ 
tic (MW), the heuristic that scored best in empirical 
tests [6]. The lowest junction tree weight obtained in 
200 runs of MW was retained and denoted by 
Tests were run on 6 graphs of different sizes and densi¬ 
ties shown in table 2. 

We ran 11 or more trials of each of our two algorithms 
on each graph. To enforce the variables to converge to a 
permutation, we minimized the objective J + A R, where 
A > 0 is a parameter that was progressively increased 
following a deterministic annealing schedule and R is one 
of the aforementioned penalty terms. The algorithms 
were run for 50-150 optimization cycles, usually enough 
to reach convergence. However, for the //-embedding on 
graph d20, there were several cases where many //. values 
did not converge to 0 or 1. In those cases we picked the 
most plausible permutation to be the answer. 

The results are shown in figure 3 in terms of the ratio 
of the true cost, obtained by the continuous embedding 
algorithm (denoted by ,J *) and J* MW - For the first, two 
graphs, h9 and hl2, J% IW is the optimal cost.; the em¬ 
bedding algorithms reach it. most, trials. On the remain¬ 
ing graphs, // ("F clearly outperforms 0-CE, which also 
performs poorer than MW on average. On dlO, a20 
and m20 it. also outperforms the MW heuristic, attain¬ 
ing junction tree weights that, are 1.6 to 5 times lower on 
average than those obtained by MW. On d20, a. denser 
graph, the results are similar for MW and //.-CE in half of 
the cases and worse for //.-CE otherwise. The plots also 
show that, the variability of the results is much larger 


for CE than for MW. This behaviour is not. surprising, 
given that, the search space for CE, although continuous, 
comprises a. large number of local minima.. This induces 
dependence on the initial point, and, as a. consequence, 
nondet.erminist.ic behaviour of the algorithm. Moreover, 
while the number of choices that. MW has is much lower 
than the upper limit, of n!, the “choices” that. CE al¬ 
gorithms consider, although soft., span the space of all 
possible permutations. 

5 Discussion 

The idea, of continuous embedding is not. new in the field 
of applied mathematics. The large body of literature 
dealing with smooth (sygmoida.l) functions instead of 
hard nonlinea.rit.ies (step functions) is only one example. 
The present, paper shows a. nontrivial way of applying a. 
similar treatment, to a. new problem in a. new field. The 
results obtained by //.-embedding are on average better 
than the standard MW heuristic. Although not. directly 
comparable, the best, results reported on t.ria.ngula.t.ion 
[7, 3] are only by little better than ours. Therefore the 
significance; of our results goes beyond the scope of the 
present, problem. They are obtained on a. hard problem, 
whose cost, function has no feature to ease its minimiza¬ 
tion (J is neither linear, nor quadratic, nor is it. additive 
w.r.t. the vertices or the edges) and thus they demon¬ 
strate the potential of continuous embedding as a. general 
tool. 

The cost, function that, we have introduced, J, has the 
twofold advantage of being more accurate than all the 
other alternative cost, functions used in the literature 
and of being directly amenable to continuous approxi¬ 
mations. Sinci minimizing J may not. be NP-ha.rd, this 





opens a way for investigating new triangulation methods. 
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